Detailed Design of Chubby: Part IV

Learn about fault-tolerance, scalability, and availability of Chubby's design.

Failovers#

Nodes can fail, and we need to have an excellent strategy to minimize downtime. One way to reduce downtime is a fast failover. Let’s discuss the failover scenario and how our system handles such cases.

A primary replica discards its state about sessions, handles, and locks when it fails or loses leadership. This must be followed by the election of a new primary replica with the following two possible options:

  1. If a primary replica gets elected rapidly, clients can connect with the new primary replica before their own approximation of the primary replica lease’s (local lease) duration runs out.
  2. If the election extends for a longer duration, clients discover the new primary replica by emptying their caches and waiting for the grace period. As a result, our system can keep sessions running during failovers that go past the typical lease time-out, thanks to the grace period.

Once a client has made contact with the new primary replica, the client library and the primary replica give this impression to the application that nothing went wrong by working together. The new primary replica must recreate the in-memory state of the primary replica that it replaced. It accomplishes this in part by:

  • Reading data that has been duplicated via the standard database replication technique and kept safely on disk
  • Acquiring state from clients
  • Making cautious assumptions
The new primary replica recreates an approximation of the in-memory state of the old primary replica
The new primary replica recreates an approximation of the in-memory state of the old primary replica

Note: All sessions, held locks, and ephemeral files are recorded in the database.

Newly elected primary replica proceedings#

The proceedings of the newly elected primary replica are shown in the following slides.

Created with Fabric.js 3.6.6
The new primary replica selects a fresh client epoch number

1 of 9

Created with Fabric.js 3.6.6
The new primary replica initially blocks any incoming session-related tasks

2 of 9

Created with Fabric.js 3.6.6
The new primary replica starts entertaining KeepAlive requests

3 of 9

Created with Fabric.js 3.6.6
The new primary replica creates in-memory data structures of all the relevant data

4 of 9

Created with Fabric.js 3.6.6
The new primary replica entertains all the requests

5 of 9

Created with Fabric.js 3.6.6
The new primary replica asks all the clients to clean their caches to avoid discrepancies

6 of 9

Created with Fabric.js 3.6.6
The new primary replica makes sure that all the clients acknowledge the failover

7 of 9

Created with Fabric.js 3.6.6
The new primary replica manages its in-memory representation of all the client handles

8 of 9

Created with Fabric.js 3.6.6
The new primary replica cleans up its space by deleting all close-handled ephemeral files

9 of 9

Point to ponder

Question 2

Why does the client library struggle to keep the old session going? Why not simply make a new session when we can contact the new primary replica?

Hide Answer

A new session comes with many pre-steps, and we want to finish a session without a failover. The system tries to mimic this effort and keep the old session going by relying on the already-established connection and stored information. In case of a switch, we’ll have to duplicate most of the previous work, and practically we want to keep this to a minimum.

2 of 2

Example scenario#

The following illustration depicts the progression of an extended primary replica failover event where the client must use its grace time to keep the session alive. The time grows from top to bottom, but we won’t scale it since it’s not linear.

Note: Arrows from left to right show KeepAlive requests, and the ones from right to left show their replies.

Created with Fabric.js 3.6.6
The client has a conservative approximation, C1, whereas the original primary replica has a session lease M1 for it

1 of 8

Created with Fabric.js 3.6.6
The client can extend its view of the lease, C2, once the primary replica commits to lease M2 and notifies it via KeepAlive reply 2

2 of 8

Created with Fabric.js 3.6.6
The primary replica dies before responding to the following KeepAlive, which takes some time until a new primary replica is chosen

3 of 8

Created with Fabric.js 3.6.6
The approximate lease,C2, of the customer eventually ends. After flushing its cache, the client begins a countdown for the grace period.

4 of 8

Created with Fabric.js 3.6.6
The Chubby library sends a jeopardy event to the application at the beginning of the grace period to enable it to stay quiet until it is sure of the state of its session

5 of 8

Created with Fabric.js 3.6.6
A fresh primary replica election is eventually successful. The primary replica first uses a conservative estimate M3 of whatever session leases the client may have had with its predecessor.

6 of 8

Created with Fabric.js 3.6.6
The client’s initial KeepAlive request to the new primary replica is declined because it contains the incorrect primary replica epoch number

7 of 8

Created with Fabric.js 3.6.6
Because M3 was cautious, the retried request (6) succeeds, but without extending the primary replica lease. The response (7), however, enables the client to renew its lease, C3, and, if desired, notifies the application that its session is no longer in danger.

8 of 8

Note: During the grace period, the client is uncertain as to whether its lease at the primary replica has ended at this time. It does not end its session, but it does stop all API requests from applications to stop them from seeing erroneous data.

Point to ponder

Question

How did Google implement Chubby’s database?

Hide Answer

The database for Chubby’s initial iteration was a replicated copy of Berkeley DB. In the successive iterations, Google did some modifications as well to make it best for their use case. Google’s implementation is mapped to Berkeley’s DB design based on the following points, with some modifications:

  • Berkeley DB offers B-trees that translate arbitrary byte-string values to byte-string keys. Based on this, Google implemented a function that compares the keys to sort the path names based on the number of components in each path. This ensured that the siblings are maintained adjacent in the sort order while allowing nodes to be keyed by their path names

    Note: Chubby does not employ path-based permissions. Therefore, each file access only requires a single database query.

  • Berkeley DB replicates its database logs over several servers using a distributed consensus approach. This made it similar to Chubby’s concept after Chubby introduced primary replica leases, which simplified implementation.

  • The replication code for Berkeley DB was just recently introduced at the time of Chubby’s release, although the B-tree code was well-established and in use. Google created a straightforward database utilizing write-ahead logging and snapshotting because the exposed replication code was more prone to risks.

As discussed previously, a distributed consensus protocol was used to disseminate the database log across the replicas. Chubby only utilized a small portion of Berkeley DB’s functionality. Therefore, Google’s rewriting allowed for a significant system-wide simplification. For instance, Google required atomic operations but did not require general transactions.

Backup#

Each Chubby cell’s primary replica takes a snapshot of its database every few hours and uploads it to a GFS file server located in a separate building. Using a different site guarantees that the backup in GFS will endure any damages on a single site and that the backups do not incur cyclic dependencies into the system; otherwise, a GFS cell deployed at a single site may depend on the Chubby cell to choose its primary replica.

Chubby's backup strategy
Chubby's backup strategy

Backups offer both disaster recovery and a way to set up the database of a replacement replica without putting additional stress on active replica servers.

Mirroring#

Chubby enables mirroring a group of files from one cell to another. The fact that the files are short and the event system promptly notifies the mirrored code whenever a file is created, removed, or updated, and makes mirroring quick. Changes are mirrored in dozens of mirrors worldwide in under a second, assuming no network issues. A mirror that cannot be accessed remains unaltered until communication is reestablished. Then, updated files are located by contrasting their checksums.

Mirroring a Chubby cell across five different locations
Mirroring a Chubby cell across five different locations

Most frequently, configuration files are copied to numerous computing clusters dispersed worldwide via mirroring. A unique cell called global has a subtree called /ls/global/primary_replica that is mirrored to every other Chubby cell’s subtree called /ls/cell/shadow. The global cell is unique because it can nearly always be reached by the organization owing to its five replica servers spread across five different geographical locations.

Some of the files mirrored from the global cell are listed below:

  • Chubby’s ACLs
  • The files that Chubby cells and other systems use to let the monitoring services know of their presence
  • Pointers to large datasets like Bigtable cells (these pointers help clients find these datasets)
  • Configuration files for other systems.

Scalability#

Google observed that 90,000 clients communicate directly with Chubby’s primary replica, which is significantly greater than the number of computers involved. Since Chubby’s clients are independent processes, Chubby must manage more clients than one might anticipate (because each node can have many processes communicating with Chubby). There is only one primary replica per cell and its machine is the same as the clients. This allows clients to vastly outnumber the primary replica. Therefore, communication with the primary replica is significantly decreased using the best scaling strategies.

We can utilize several mechanisms to scale Chubby. A few of them are as follows:

  1. Minimizing round-trip time (RTT): Any number of Chubby cells can be created; clients usually use a local cell (located with DNS) to avoid relying on distant computers. For a data center with several thousand computers, our standard setup requires one Chubby cell.
  2. Minimizing KeepAlive loads: When under a substantial load, the primary replica may raise lease times from the standard 12 seconds to around 60 seconds, requiring it to process fewer KeepAlive RPCs (KeepAlives are by far the most common request, and their inability to be processed promptly is the typical failure mode of a server that is overloaded–clients are often oblivious to variations in latency in other calls.).
  3. Optimizing caching: To minimize the number of calls Chubby clients make to the server, they cache file data, metadata, the absence of files, and currently open handles.
  4. Protocol conversions: We use protocol-conversion servers to convert the sophisticated Chubby protocol into other less complicated protocols like DNS. We’ll review proxies and partitioning as two such scaling mechanisms below.
Scaling mechanisms for Chubby
Scaling mechanisms for Chubby

Proxies#

Trusted processes that relay requests from other clients to a Chubby cell can proxy (use the same protocol on both sides) Chubby’s protocol. While a proxy can handle both KeepAlive and read requests to lessen server load, it cannot lessen write traffic since it goes via the proxy’s cache. Proxy servers enable a considerable increase in the number of clients because write traffic is significantly less than 1% of Chubby’s typical workload, even with aggressive client caching.

Proxies can positively impact different requests in the following ways:

  • KeepAlive traffic is decreased by a factor of NproxyN_{proxy} if a proxy manages NproxyN_{proxy} clients, which might be 10,000 or more per cell.
  • At best, a proxy cache can reduce read traffic by a factor of 10 or by the average amount of read-sharing.

However, as reads only make up around 10% of Chubby’s current load, the reduction in KeepAlive traffic is far more significant (in one of Chubby’s installations at Google, they observed 1% write traffic, 10% read traffic, and the rest for the other operations related to sessions, etc.).

Scaling Chubby using proxies
Scaling Chubby using proxies

Point to ponder

Question

Are there any associated drawbacks of using proxies?

Hide Answer

For writes and first reads, proxies add an extra RPC. For these specific requests, each proxied client depends on two servers now, namely, its proxy and Chubby’s primary replica. Since both of them have a chance of malfunctioning, one can anticipate that by using proxies, the client temporarily feels the failed service at least twice as frequently as before.

Partitioning#

We can shard Chubby’s namespace for scalability. Doing so, a Chubby cell would consist of N partitions, each including a set of replicas, one of which is a primary replica.

Let’s assume that we have a directory DD and there are one or many child leaf nodes CC with the namespace of D/CD/C. If we enable partitioning, the partition P(D/C)=hash(D) mod NP(D/C) = hash(D)\ mod\ N would contain all of the nodes D/CD/C in the directory DD. Since the metadata of a directory DD may be on its parent node— let’s call it D′D'— then a separate partition P(D)=hash(D′) mod NP(D) = hash(D′)\ mod\ N may store the metadata for D.

Partitioning in Chubby was done to facilitate large Chubby cells with minimal communication between the partitions. However, that’s not always the case: despite the absence of hard links, directory modification times, and cross-directory rename operations in Chubby, several operations still call for cross-partition communication. Here are a few sample scenarios:

  • Most clients read publically available files that do not require an ACL. ACLs are files themselves and are easily cached; however, one partition may utilize another to verify permissions, especially for Open() and Delete() calls, as they require an ACL check.
  • When a directory is removed, a cross-partition call may be required to verify that the directory is empty.

Note: Since most calls are handled individually by each system partition, we anticipate that this communication will only have a minor effect on performance or availability.

One would anticipate that each client will contact most of the partitions unless N is a very big number. Consequently, partitioning decreases read and write traffic on every given partition by a factor of N, but partitioning does not always reduce KeepAlive traffic because the ongoing sessions might span across partitions.

Note: We can use a combination of proxies and partitioning should Chubby need to handle more clients.

This marks the end of our discussion on design. The next lesson will discuss the rationale behind various design decisions.

Detailed Design of Chubby: Part III

The Rationale Behind Chubby’s Design